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ABSTRACT 

This paper posits a taxonomy for categorizing issues 
that arise in the evaluation of Intelligent Tutoring Systems (iTSs) . 
The taxonomy has three dimensions: Life Cycle of Evaluation, Research 
Issues, and Methodological Issues. The Life Cycle dimension has four 
levels: pre~experimental , laboratory study, field study, and initial 
operational test and evaluation. The three levels of the Research 
Issues dimension — functionality, effectiveness, and cost — are 
subsequently further divided into several sublevels. The 
Methodological Issues dimension is discussed in the context of each 
of the Research Issues levels. A recommendation from this work is 
that ITS evaluation studies should adopt multi-dimensional, 
multi-method designs. (Author) 
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SUMMARY 



Svstlms^ffl^r'rh?/^°"°"'^u°' «9°5^'"9 's^^^s that arise in the evaluation of Intelligent Tutoring 
f/!?^ ,^ ^ • taxonomy has three dimensions: Life Cycle of Evaluation Research Issues and 
Methodological Issues. The Life Cycle dimension has four levels- pre-experimenta . Katory s?udy fS 
f nrtinL"ntw" H °P«^^^'°"^' e^^'^ation. The three levels of the Resea;ch Issues^cJSon- 

MPth^ ^^^'^^'^^"f^S' cost-are subsequently further divided into several sublevels. The 
Methodological Issues dimension is discussed in the context of each of the Research Issues levels A 

'''' '''^ ^^"''^^ ^•^^"'^ ^^°P^ multiSnslonS r^ulti ^^^^^^^ 



ERIC 



G 



PREFACE 



The mission of the Intelligent Systems Branch of the Training Systems Division of the Air 
Force Hunr^n Resources Laboratory (AFHRLyiDI) Is to design, develop, and evaluate the 
application of artificial intelligence (Al) technologies to computer-assisted training systems. 
The current effort was undertaken as part of IDV^ research on Intelligent tutoring systems 
(ITSs). ITS development tools, and intelligent computer-assisted training testbeds. The work 
was accomplished under Work Unit 1 121-09-29. Intelligent Training Worlds. 

An earlier version of this paper was presented at the Annual Meeting of the American 
Educational Research Association in San Francisco. CA. 31 March 1989. We would like to 
thank Drs. Joe Psotka (Army Research Institute) and Valerie Shute (AFHRLyMOE) for their 
comments on that paper. 
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INTELLIGENT TUTORING SYSTEMS: 
A TAXONOMY OF EVALUATION ISSUES 



I. INTRODUCTION 



Evaluation is the process of applying 'scientific procedures' to collect 'reliable and valid Information' to 
make 'decisions' about an 'edu -Vonal program' (Berk, 1981, p. 4). 

The goal of this paper ^ j dst evaluation Issues In an applied setting and to propose a taxonomy which 
structures those issues. With the recent advances in the field of Intelllgent Tutoring Systems (ITSs), it is time 
to compile the issues that are important to the products of those advances. Most Issues described in this 
paper are substantive or "research" In nature. 



In our particular setting, three groups can be involved in the design, development, evaluation, and 
implementation of an ITS (or tools created and used In the development of an ITS)(see Figure 1). These are 
a contractor, our research laboratory, and the Air Force (AF) training community. In the building of an ITS. 
a contractor (many times a university) is responsible for the design and development or coding of the ITS 
according to the specifications set by laboratory personnel. Upon completion, the contractor delivers or 
transitions the completed product to the research laboratory. Laboratory personnel are then responsible for 
evaluating and demonstrating the ITS to the AF training community. The latter must deckJe whether or not 
to support the continued development and ultimately the operatlonallzatlon of the ITS. 



design 



Development 

Evaluate 

Unplement 



I 



Contractor 



Laboratory 



AF Training 
Community 



Figure 1 . ITS Development Roles. 



Note : X denotes typical participation 

7 denotes occasional participation 



While the above scenario is generally true, the ^oles of the three agents vary from project to project. In 
some situations the laboratory and the contractor share the responsibility for the design and development of 
the ITS. In addition, the contractor and the AF training community may be involved participate in the 
evaluation of the tutoring system. Nonetheless. In any of the variations of roles, the contractor delivers the 
ITS to our research laboratory and then to the AF training community. 

One important characteristic of this setting is that the developers, evaluators, and decision-makers may 
not be the same agents throughout the development and implementation of a tutoring system. In many 
cases, the contractor plays the role of the developer, the research laboratory personnel are the evaluators 
and the AF training community makes the decisions concerning adoption of the training systems being 
developed. This separation requires a more extensive evaluation methodology than when one agency plays 
O „ all three roles. p. 



Consequently, the evaluators must not only determine the effectiveness of the system (i.e., do the learners 
actually learn something), but also must evaluate the system on several other dimensions. For instance, the 
evaluator must assese whether the tutor n^s design requirements. This may Include assessing the 
functionality of the system (e.g.. does the tutor perform In the manner specified In the design documents), 
its effectiveness (e.g.. do learners learn?), and Its efficiency (i.e.. is it cost beneficial?). 

The evaluation of the tutor must be complete enough for decision-malcers to determine its value and 
relevance to their needs. Demonstrating that students learn within the tutoring environment Is not adequate 
for the AF training community to support advanced development and implementation of the tutor. The 
evaiuators must be able to show that the tutor increases on-the-job performance and Is cost efficient. AF 
training personnel, while interested In gains In perfomoance on the tutor, are much more interested in data 
that clearly show an improvement In Job perfornrwince. the ability to learn on the job, or increased motivation 
to learn more on the job. Simple learning or perfonroince gains within the tutoring environment In a laboratory 
setting, while necessary, are not sufficient enough for training decision-makers. 

As a result of this setting in which the developers, evaiuators. and decision-makers are not the same 
individuals, the nature of evaluation of the system must be more comprehensive than traditional evaluations 
that address the single, large question. "Is the ITS effective?" 



il. TAXONOMY OF ITS EVALUATION 

We are proposing a taxonomy for the evaluation of ITSs that i-tas three inter-related dimensions: Phase 
or Life Cycle of Evaluation. Research Issues, and Methodological Issues (see Figure 2). The Life Cycle 
dimension refers to a sequence of evaluation from Initial pre-experlmentai studies, to laboratory and field 
studies, and finally to research on the Implementation In the actual training setting. The second dimension. 
Research Issues, covers the wide range of substantive Issues researchers address concerning a system's 
functionality, effectiveness, and cost. The final dimension, Methodological Issues, includes issues that must 
be addressed In planning and conducting evaluation studies. Each of these dimensions are elaborated below. 




RESEARCH 
ISSUES 



METHODOLOGICAL 
ISSUES 

Figure 2 . Taxonomy of Evaluation issues. 

The phase or life cycle of evaluation consists of levels which cover the flow of evaluation studies from 
pre-experimental evaluations through a series of studies in which experimental control decreases while 
operational qualities increase (see Figure 3). At the earliest point of an evaluation cycle, pre-experimentai 
studies determine characteristics of the software, changes In student knowledge and skills at a detailed level, 
and developmental costs using no or few subjects (I.e.. pilot suhjects). in some cases, subject matter experts 
(SMEs) review the software for its accuracy and functionality. Laboratory and field studies determine issues 
such as the Instructional effectiveness of the ITS at a larger scale than the pre-experimental study. While 
laboratory studies allow a high degree of experimental control of extraneous variables, they may not provide 
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much Information about tho application of the ITS in a realistic work place context. Field studies may provide 
the latter, but they sacrifice experimental control of Independent arxl extraneous variables. These studies 
also address Issues concerning functionality and costs. The final phase Is the Initial Operational Test and 
Evaluation (lOT&E) of the ITS. In this type of study, the system under Investigation is actually Implemented 
for a period of time (e.g., several months) In the same manner Intended when It is put In an operational 
environment. This study collects the data used to finally decide on changes to the system before continuing 
with its use in an operational setting. 




lOT&E 

LAB 

EXPERIMENTAL 

Figure 3 . Life Cycle Dimension. 



The second and third dimensions of the taxonomy, Research Issues and Methodological Issues, directly 
affect the nature of the evaluation. The second dimension, research issues, consists of the three broad 
categories of ITS assessment: functionality, effectiveness, and cost (see Figure 4). Standard experimental 
textbooks address the methodological Issues: nature of the subjects, design of the study. Independent and 
dependent variables, instruments, and procedures. The main focus of the remaining portion of this paper is 
primarily on the Research Issues dimension and secondarily on the Methodological Issues dimension. 
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Figure 4 . Research and Methodological Issues. 
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attention to the nature of the subjects used in a pre-experimentai study of an iTS's functional capabilities 
while paying more attention to the instruments and dependent variables. 



lil. CATEGORIES OF RESEARCH ISSUES 

In our applied setting, we raise three categories of issues: functionality (What does the ITS do? Does the 
ITS do what it should do?), effectiveness (Is the ITS effective?), and cost/benefit (Is the ITS efficient?). To 
explore each research issue within these broad categories methodological issues concerning (a) design of 
the evaluation study, (b) types of data to collect, and (c) methods of data collection must be addressed. The 
following sections of the paper will briefly describe several further research Issues within each of these 
categories. 



Functionality Issues 

Completeness of Code 

The issue here is whether the developer delivered all source code, compiled code, and associated 
programming documentation. This is necessary to ensure that the tutor Is executable, designed appropriately, 
well-documented, and modifiable or extendible for further development. As with all software, bugs should be 
expected and the only way to fix them is with access to the source code and proper documentation. This is 
not always trivial. 



Requirements 

This Issue is whether the developed system meets two types of requirements set forth at the beginning 
of the ITS development. One is functional specifications and the other is performance. Functional 
specifications describe how the tutor should behave. They are described at the beginning of the research 
effort and can include requirements for, among other characteristics, the system's human-computer 
interaction, instructional approach, and hardware. Evaluation of this type. then, involves comparing the 
design of the ITS to the functional specifications prescribed prior before development. The performance 
issue Is whether or not the implemented system as a whole meets those requirements. For Instance, the 
functional specifications and design may include a context-sensitive help facility. Performance evaluation 
would assess the degree to which a context-sensitive help facility was Indeed implemented. This issue must 
be addressed for contract evaluation, in prototyping new systems, and for subsequent enhancement of the 
system. 



Relation to Taxonomy of Learning Environments 

ITSs vary greatly in how they interact with students, the structure of the curriculum, the types of knowledge 
students will be learning, and so on. Kylionen and Shute (1988) proposed a taxonomy of learning 
environments for describing and classifying ITSs. The four proposed dimensions are: knowledge type, 
Instructional environment, domain, and learning style. The knowledge type dimension Includes declarative 
knowledge (knowing that), procedural knowledge (knowing how to perform a task), and mental model 
(knowledge of the causal relatione -vithin a domain). The Instructional environment dimension Is classifies 
the instructional approach embodied in the ITS. Examples include Learning by Analogy. Learning from 
Instruction, and Inductive Learning. Domain, the third dimension described by Kylionen and Shute (1988), 
represents dimensions underlying domain-specific learning. Domains vary in the degree to which technicali 
quantitative, qualitative, and verbal ability play a role in competent performance. The final dimension of the 
learning taxonomy is Learning Style. It covers the learner's characteristics that influence instructional 
activities and in turn can be modified through instruction. While we direct the reader to Kylionen and Shute 
(1988) for further details, suffice it to say that a taxonomy such as the one described here would have 
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activities and in turn can be modified through Instruction. While we direct the reader to Kyllonen and Shute 
(1988) for further details, suffice it to say that a taxonomy such as the one described here would have 
implications for evaluations of ITSs. 

Adopting a taxonomy and evaluating a tutor relative to it has both scientific and practical benefits (Kyllonen 
& Shute, 1988). On the practical side, a taxonomy might provide information to decision makers about the 
applicability of a specific tutor to the problem at hand. A taxonomy might also suggest ways to describe the 
nature of the material taught In the domain (e.g., predominantly procedural). A third way a taxonomy might 
be helpful is in specifying an appropriate instructional approach given the domain. To realize these benefits, 
evaluators must plan and conduct studies which collect data pertinent to the dimensions anc^ their levels of 
the adopted taxonomy. 

Furthermore, as the ITS field grows and matures, review (e.g., meta-analysis) studies will compare results 
and synthesize conclusions. These reviews will be highly dependent on the completeness of and accuracy 
in the descriptions of the student samples, educational treatments, data collection procedures and so on 
(Abrami, Cohen. &d*Apollonia, 1988). Adoptingataxonomy will then facilitate the accuracy and subsequent 
usefulness of those reports not only for decision makers in the training community, but also for the scientifrc 
community. 

The scientific community would also benefit from a learning environment taxonomy in forming research 
hypotheses and guiding evaluative research. Research questions could be raised about each cell of the 
proposed taxonomy. Questions such as: "Is an expository approach appropriate for students with strong 
perceptions of self-competence in the domain of air traffic control?" Furthermore, a taxonomy could generate 
research questions concerned with the efficacy of different design approaches of ITS components (see below) 
given the cells of the taxonomy. For instance, how rich or Intensive does tfye student modelling approach 
need to be given a particular Instructlona! approach, domain, and type of knowledge taught? 



ITS Components 

Several research issues surround the evaluation of the functionality and design of ITS components. It is 
not sufficient enough to only ask if the ITS as a whole is effective or whether ITS components are effective. 
We must also address whether ITS components function in accordance with design specifications or human 
performance In the real-world setting. Each component and a few issues are presented. 

One component of an ITb is the Expert Model. It represents the domain related knowledge that experts 
possess about tasks that are performef^, problem solving techniques and strategies, equipment, and experts' 
reasoning (Anderson, 1988). An impo tant question that must be addressee^ is whether the representation 
of the knowledge and reasoning skills are appropriate, accurate, and complete given the domain. Relatedly. 
how to verify the veracity of the representation is also critical (a methodological issue). 

Another component of an ITS is the Student Model. It represents the characteristics of each student and 
is dynamically updated based on a student*s performance during the course of Instruction (VanLehn, 1988). 
Questions addressed here include: Is the representation of the student employed in the student model 
detailed and complete enough for capturing a student's strengths and weaknesses relative to the domain? 
How intensive does the student model need to be given the nature of the domain? How should we evaluate 
the appropriateness of the representation of the student in the stud*?nt model? Its evaluation is critical 
because it plays an important role in diagnosing a student's needs and individualizing instruction. 

A third component of an ITS is the Instructional Model. It is responsit>le for comparing a student's 
performance relative to expert performance, developing an Instructional plan based on the student's needs 
and abilities, and delivering that instruction (Halff, 1988). This capacity to dynamically adapt instruction to 
the individual is one of the greatest advantages of ITSs over more traditional computer-based training (CBT) 
or classroom trair r.g where the student-teacher ratio is high. Evaluation studies must be able to assess to 
what degree the system under evaluation actually accomplishes this. ITSs have the potential to individualize 



Instruction through context-dependent explanations, remediate after failure, coach a floundering student 
present Instruction based on the student's learning style, monitor the amount of time remaining for the lesson' 
and respond to student requests. Evaluation studies need to further explicate the issues centering around 
individualization and how to assess the extent to which an ITS Individualizes Instruction. 

Studies should also evaluate the relationship of the instructional approach embodied in the tutoring 
system to theories and principles of learning and Instruction. ITSs Intentionally or by default adopt a particular 
instructional approach or approaches In tutoring students. They also vary In the dt gree to which they adopt 
those teaching strategies or techniques. Evaluation studies could then determine whether an ITS follows a 
well-founded instructional theory and to what extent. Furthermore, this kind of analysis could provide 
information for improving instructional theory and practice underlying ITSs. 

Evaluation studies of the fourth component of an ITS, the Interface, have traditionally addressed Issues 
of user acceptance. In an actual training environment, the "user" could be the student or the 
instructor/administrator. Studies of student acceptance have investigated the computer-to-student flow of 
infonnatlon (e.g.. the student'sabiiity to understand directions and explanations) and the student-to-computer 
flow (e.g., menulng). For example, Williams, Hamel, and Shrestha (1987) have constructed a checklist for 
evaluating computer-assisted Instruction (CAI) interfaces. 

interfaces also must be acceptable to Instructors/ administrators. Evaluation studies should determine 
if instructors consider the tutor (a) easy to use, (b) easy to learn, and (c) easy to teach to students, if tools 
such as authoring or management, are available, evaluators must also determine how easy each is to learn' 
understand, and use. 



Evaluating an inte.rface must go beyond assessing traditional "acceptance." Frye, Littman, and Soloway 
(1988) found that inexperienced users (in this case children) had more problems operating the programs 
than older, more experienced children. Difficulties in using the programs reduced the students' access to 
the educational content, thereby reducing the overall instructional elfectiveness of the tutor. Frye. Littman 
and Soloway (1988) pointed out that not only was co.ntact with the instmctional content reduced' but also 
that the interface directly interfered with students' understanding of the content. This example points to the 
need for evaluation of interfaces beyond that of surveys or ratings of user acceptance. 



Instructional Context 



Implementing an ITS in an actual classroom context can have profound impacts on that context. Not 
only may student and teacher roles change, but also the student-teacher interaction may change 
(Zimmerman, Smith, Bastone, & Friend, 1989). In addition to adjusting to changing roles, the teacher must 
be able to integrate the ITS into the existing curriculum and daily and weekly schedules. There may also be 
physical characteristics of the instructional context that must be taken into account, such as hardware and 
the arrangement of the room. It is especially important to measure these potential changes in the instructional 
context during the infancy of ITS implementations. Early findings could lead to subsequent research 
addressing instructional context characteristics that facilitate or hinder the final Implementation of tutorino 
systems. ^ 



Methodological Issues 

After deciding what features of an ITS to evaluate, researchers must address several methodological 
issues. One set involves the design of the study. Some research questions lend themselves to experimental 
comparison; others require interviews with domain experts. The design of the study is not only affected by 
the research question, but also limitations In resources allocated to the evaluation. For instance, since domain 
experts are scarce and in large demand, it Is not feasible to have 20 domain experts review the representation 
of expert reasoning. 
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Data Collection. Another set of methodological Issues Involves the type of data to collect. Steinberg 
(1984) gives an excellent enumeration of data Important to this issue of functionality. In it she states they 
should revolve around the accuracy and completeness of the content, an expert instructor's opinion of the 
method of presentation, technical flaws, flow of the lesson, time required to completes session, and students' 
attitude toward the tutor. She recommends keeping a computer file of students' keystrokes for analysis of 
such questions as: 

What proportion of keystrokes or clicks are erroneous? 

How long do students spend on each part of the tutor? 

How many times do students press the help key? 

What is the number and nature of unanticipated keystrokes? 

Another computer file she recommends is one that allows students to enter immediately comments or 
recommendations about the tutor. 

Data collected by unobtrusive observers is also significant. During student trials, they can determine 
such information as the keyboard/mouse manipulation requirements Imposed by the tutor, the readability of 
screens, clarity of Instructions and the tutor's technical correctness. 

Steinberg (1984) also recommends observers interview tutored students to collect valuable feedback 
data. In these, students are asked their overall opinion, what they consWer the best and worst parts, 
recommendations, and clarification of any notes the observer makes during the session. Furthermore, 
Kyllonen and Shute (1988) list 29 indicators of student's progress In an ITS relating to activity level and 
exploratory behaviors, data recording, use of embedded tools, effective generalizations of principles, and 
effective experimental behaviors. 

Instruments. Issues concerning how to collect the data are directly tied to decisions concerning what 
data to collect. Instruments include, but are not limited to, the ITS Itself, verbal protocols, video taping, 
checklists, ratings scales, technical analysis of code, and Interviews with students, experts, instructors, and 
administrators. 



Effectiveness Issues 

The most important information that researchers can collect, summarize, and present to decision-makers 
concerns the effectiveness of the tutoring system. Other issues, such as the functionality of ITS components, 
are not of consequence if evaluations of a tutor show that the ITS Is not effective In producing student gains. 
Although evaluating functionality is relatively straight fon/vard, determining the effectiveness of a tutoring 
system is not. Decisions must be made about the design of the study (e.g., experimental, longitudinal), what 
constitutes an appropriate control group, controlling or measuring access to the tutoring environment, 
measuring access to the curriculum, measuring the effects on student performance and motivation to 
continue learning and performing. 



Access to Learning Environment 



Since one goal of an evaluation may be to determine what ITS component or components affect student 
performance, access to the learning environment must be controlled or measured and analyzed. This type 
of access can be thought of In terms of total time allocated for training or In terms of the quality of that time 
(e.g., uninterrupted blocks of time). If the ITS group receives more learning opportunities than the control 
group (e.g., On-the-Job Training (OJT)), then the comparison Is not one of effectiveness of an ITS compared 



Access to the Curriculum 



Training systems, In general, vary In the representation of donialn-relevant Information, such as heuristics, 
algorithms, concepts, and devices. Training systems also vary In the ways students access that information. 
For example, in traditional classrooms students access the curriculum through bocks and lectures. ITSs now 
provide unique ways for students to come Into contact with the domain due to their ability to present 
information In time-compressed methods and through their ability to represent the knowledge and skills from 
several individuals who could not be present for training purposes (e.g., domain experts). Evaluation studies 
should address access to the curriculum as part of a training system. 

ITSs can present more of a curriculum to each student due to its ability to deliver time-condensed training. 
An example is in a microworld named Orbital Mechanics (OM), The goal of OM is to develop in the student 
an understanding of the relationship between several numerical parameters and the visualization of the 
ground trace of a satellite. It takes a student about 2 to 5 hours to perform all the equations underlying the 
orbit by hand. In OM, it takes about 5 to 1 0 seconds, because the equations are embedded in the microworld. 
Thus, a student can "access" movj of the curriculum due to the time compressed delivery capabilities. 

Another way in which access to the curriculum may be increased Is through the representation of an 
expert in the ITS. In the work place environment, experts are not plentiful and do not have time to train novices 
how to solve domain problems. By embedding expert knowledge and reasoning in the ITS. more novices 
can have access to expert thinking. As a result, more students can be trained without allocating expert 
resources to the training process beyond initial development of the tutor. This increased access to expert 
reasoning, and hence the curriculum, could account for the differences between ITS applications and 
alternate training approaches. 

One advantage of measuring access to the curriculum Is that it gives decision makers additional 
information about the potential of ITSs In general. The empirical demonstration that ITSs can deliver more 
instruction in the same amount of time or the same instruction in less time provides additional important 
information about the benefits of adopting an ITS approach. 



Learning Indicators 

By far, the single most significant finding of all evaluation is whether or not ITSs increase a student's 
knowledge, skills, and strategies in the target domain. Studies designed to answer this "grand" question 
must, therefore, gather valid indicators of learning. These indicators can be collected prior to a student 
entering Instruction, during instruction, and after instruction. Not only can we collect data on changes in 
domain related knowledge, skills, and strategies, but also in other more subtle indicators revealed in the 
dynamics of the student's Interaction with the learning environment. For instance, changes in the pattern of 
student* s help requests may Indicate growing cognitive structures. Other indicators include such measures 
as latencies in interacting with the learning environment, menu selection, and responses to tutor advice or 
directives. Kyllonen and Shute (1 988) described the impact of an ITS on students' learning by collecting data 
in 29 learning indicators within three broad groups-activity and exploratory behaviors, data management 
skills, and thinking and planning skills. 

Posttest data is the primary data of interest when one Is evaluating an ITS as a whole. This can be done 
by comparing it to some other educational system (e.g.. traditional education) or by assessing changes in 
the level of knowledge, skills, and strategies of individual students. However, posttest and concurrent data, 
which can be collected unobtrusively, can be used when assessing ITS component effectiveness. This latter 
form of evaluation can be done by comparing the ITS in question to another computerized instructional systetii 
or by analyzing changes in students' performance profiles. 
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Job Performance 



One issue that training system decision makers want addressed is the training system's ability to improve 
on-the-job learning and performance. While researchers may get excited about gains in student performance 
within a tutoring environment, the operational training community needs evidence that ITSs Improve actual 
job performance. This requires that data outside of the ITS environment (e.g., troubleshooting ability) be 
collected. The requirement to gauge the Impact of training with ITSs on job performance Is essentially one 
of measuring transfer of training. Cormier and Hagman (1987) give an excellent treatment of this Issue. In 
Chapter Nine, "Measuring Transfer in Military Settings," BoldovicI (1987) notes training with devices can have 
positive, negative and neutral effects on job performance. 

The usual experimental design for measuring transfer Is to first train two groups with different methods 
and then measure differences in their Job performance. To reduce sources of error frequently encountered 
in this design. Boldovici suggests one in which job performance of three groups is measured after three 
training intervals. The groups are: (a) the device group which receives training with the device (ITS in our 
case), (b) the conventional group which receives conventional training without the ITS. and (c) the control 
group which receives no training. Tests are given to each group at three equal intervals during the period 
that is usual for conventional training. Table 1 depicts this design. 



Table 1 . Schedule for a Transfer Experiment 











Weeks 












1,2,3,4 




5,6,7,8 




9,10,11,12 


De'vi'cegroup 


Test 
TaskB 


Train 
Task A 


Test 
Task B 


Train 
Task A 


Test 
TaskB 


Train 
Task A 


Test 
Task B 


Conventional group 


Test 
TaskB 


Train 
Task A 


Test 
Task B 


Train 
Task A 


Test 
Task B 


Train 
Task A 


Test 
Task B 


Control group 


Test 
TaskB 




Test 
TaskB 




Test 
TaskB 




Test 
Task B 



Boldovici's design has several advantages. It separates the amount of training from the effects of training 
media by making it an investigated effect. It also allows for the inspection of reliability of Task B 
measurements. Furthermore, causal linkages between learning Task A and performing Task B can be made. 

Motivation to Learn and Perform 

Not only do new training systems affect cognitive variables (e.g., domain knowledge), but they influence 
students* motivation and attitudes. According to Bandura (1982) and Schunk (1984), positive experiences 
in a learning situation lead to the development of positive self-efficacy. This in turn leads to Increases In 
willingness to learn more, willingness to take risks, and willingness to persist In the face of failure. Given this 
perspective, studies of ITSs should evaluate the effects on motivation not only to perform, hut also to learn 
more. 

ITS Components 

In early stages of development of ITS technology evaluative studies should assess the effectiveness of 
various approaches to ITS components. ITSs have the ability to record data about each student's activities 
and performance, the instructional events that occur, and the relationship between the two. ITSs are data-rich 
environments-they can assess not only the effectiveness of components In Isolation, but also in complex 
interactions. For instance, in one domain it may not be necessary to have an elaborate. Intensive model of 
student knowledge and skills, but In a different domain a detailed representation of the student's abilities, 
misconceptions, and performance may be required for the ITS to be effective. 
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Methodological Issues 



As described under the section on functionality issues, evaluators of ITSs must address methodological 
issues concerning subjects, design, variables, instruments, and procedures before conducting evaluative 
studies. Discussion of three important methodological issues follow. 

Comparison group. The choice of a comparison group Is tied directly to the specific question addressed. 
If the goal is to make conclusions about a tutoring system's effectiveness relative to an extant educational 
system, then that extant group can serve as the comparison group. In contrast, if the goal is to determine 
what components or functions make a tutor effective then the extant group can serve as a comparison group 
only if controls are placed in the extant learning environment that limit the Influence of extraneous variables. 
For instance, if the goal is to determine whether an ITS is effective due to the indivklualization of Instruction, 
then control of other variables, such as the quantity and quality of the curriculum, must occur to guarantee 
the equivalence of the two groups. Table 2 presents the dissimilarities of three training environments in the 
Air Force. Because of the vast differences, extant educational systems should be used for comparison only 
when the goal is to show differences In educ^itional systems at the system level. Other approaches are 
needed, such as monitoring changes in the stiident model as a result of instructional events, if the goal of 
the evaluation is to determine the effectiveness of specific ITS component approaches. 



Table 2 . Stylized Description of Three Educational Systems 



Educational 
system 


Curriculum 


instructional 
materials 


Agent of 
delivery 


Technical 


Structured 


Texts, Notes, 


Instructor 


School 




Lectures 




On-the-job 


Incomplete, 


Manuals, 


Expert 


Training 


Fragmented 


Actual Equipment 


ITS 


Structured 


Problems 


Interface 






Text 


Module 



Instruments. Several techniques have been used to collect data In the evaluation of instructional systems. 
The most prominent is to have student performance data collected by the computer or via external measures 
such as paper and pencil. Others have used measures which reflect actual job related performance, verbal 
protocols (both concurrent and retrospective), interviews, surveys, audio and video recordings, and direct 
observations. 



Cost Issues 

The third level under the Research Issues dimension of our proposed taxonomy of evaluation is cost. To 
evaluate fully an ITS for potential implementation, data needs to be collected not only on the cost of 
development, but also on cost of evaluation of the ITS, initial implementation in the operational environment, 
and nrjaintenance and updating once the ITS is operational. Development costs include the time an^' dollars 
spent on and by knowledge engineers, subject-matter experts, instructional developers, and computer 
programmers. Evaluation costs are not trivial when dealing in an applied setting and might easily be 
overlooked. Evaluation costs could include travel expenses, subject-matter expert time, student time, and 
evaluator time. Decision makers in the training community need estimates of Implementation costs for 
accurate planning and budgeting. Implementation costs include those related to course instructor training, 
hardware requirements, and software needs (e.g.. knowledge engineering and authoring tools). Decisiori 
makers also need to know potential maintenance costs for the hardware and software once the instructional 
system is in place. 
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IV. CONCLUSIONS AND RECOMMENDATIONS 



As with much in the worlds of education and computers, developers are hurriedly building ITSs for real 
world applications. We see this as an expensive but positive step. To reduce costs and facilitate the 
proliferation of this technology, researchers must increase discoveries about the effectiveness of ITSs and 
use those findings to produce better tutors. This can best be done with an organized approach by the 
resrarch community to evaluate these systems by constructing and applying experimental paradigms which 
address the issues mentioned in this paper. 

We offer several recommendations: 

1. Adopt a taxonomy of learning environments for an efficient, comprehensive description of ITS 
functionality. 

2. Adopt multi-method evaluation methodologies. 

3. Describe evaluation studies fully. 

4. Find out in as much detail as possible what potential users of ITSs require--get to know the users In 
more :han a clinical sense. 

5. For evaluating effectiveness, especially for simulation based ITSs, consider the experimental design 
proposed by Boldovici. 

6. Create a taxonomy of effective designs of ITS components for different domains. 
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